Method of compressing data packets

ABSTRACT

A method of compressing data, wherein said data is in the form of discrete units, by determining an overall code specific to the units, comprising the steps of: 
     a) grouping the units in terms of a common behavior pattern; 
     b) for each said group of units, assigning a group specific code, the sizes of said group code being ordered according to the probability of the unit occurring; 
     c) assigning a unit identifier code which is specific to identify an individual character within the group, 
     the overall code comprising of the cocatenated group specific and identifier codes, characterized wherein 
     step c), the identifier code is of the minimum size to allow for each unit which could occur in that group to be assigned specifically.

This application is a 371 of PCT/EP01/04607, filed Apr. 24, 2001, and isrelated to U.S. patent application Ser. No. 10/381,718, filed Oct. 10,2003.

BACKGROUND OF THE INVENTION

Many digital communications systems send data in packets. These packetscontain headers at the start of the data. The header comprisesinformation relating, e.g., to the destination address of the packet,the length of the packet and the type of payload data contained inside.The header can be considered a long character comprising a string ofbits.

Mobile telecom networks and the Internet are converging in terms oftheir functionality. It is desirable for third generation mobilehandsets to understand Internet (IP or ATM) data packets directly toallow for seamless email, web browsing and multimedia services to themobile user. Protocols such as IP are designed to run on fixed networkswhere bandwidth is plentiful, and so they are costly in the mobile phoneenvironment. When used to carry speech, the overhead resulting in usingIP can be up to 75% of the total network capacity, which is unacceptablefor mobile networks.

One solution to this problem is to compress the IP header just before itcrosses the air interface. A number of compression schemes exist forthis purpose (Van Jacobson, CRTP etc.), which variously make trade-offsbetween efficiency, flexibility and simplicity.

Known data compression systems include the Huffman Algorithm which isdiscussed in detail in co-pending application. This publicly availablestandard is widely used in many compression schemes including “WinZip”.Huffman encoding compresses a data stream one character at a time, wherea character is usually one byte. The basic compression is not veryefficient, but it is possible to obtain better results by applying themethod recursively or by increasing the size of one character. However,this increases the processing and/or memory requirements of thealgorithm.

In order to understand the invention the prior art will now beexplained.

Ordinary Huffman

Huffman encoding is a publicly available compression standard used inmany popular compression schemes such as “WinZip”. All Huffmancompressors work on a stream of characters (for example ASCIIcharacters). The basic idea is to create a new set of compressedcharacters or codes, where each normal character maps onto a compressedcharacter and vice versa. Frequently occurring, i.e. common characters,are given shorter compressed codes than rarely used characters, reducingthe average size of the data stream. The compression ratio can beimproved by increasing the size of one character, but at the expense ofhigher memory requirements. In fact the memory used when running aHuffman compressor grows exponentially with the character size, so16-bit characters need 256 times as much memory as 8-bit characters.

FIG. 1 illustrates how ordinary Huffman works. In the example, itrelates to 10 different possible characters (a set of 10 ASCIIcharacters) as shown in single inverted commas (in general a charactercan be anything e.g. a byte, a header, an ASCII character etc). Aprerequisite is to know, for the characters, the approximate probabilityof that character turning up in the data sequence, the skilled personwould understand that this can be done in any appropriate way (e.g. alarge stream of characters is taken and one determines how often eachcharacter appears).

In the worked example the ordinary Huffman tree needs 10 starting nodes,one for each possible character. These nodes are plotted at the top ofthe Huffman tree, together with the percentage chance that the characterturns up in an uncompressed data-stream. The characters are orderedgenerally in terms of increasing probability. The space character is avery common character and put last. As shown in the figure, the boxunderneath each character shows the probability of occurrence. To buildthe tree, the two nodes with smallest probabilities are joined up toform a new node. The left-hand branch is labelled with a “1” and theright hand branch with a “0”. The new node is obtained with aprobability of the combined root nodes (in the first case this is 6%).This process continues until there is only one node left, at which pointthe tree is finished. In general, the branch with smallest probabilityis labelled with a ‘1’, and the second smallest with a ‘0’. The sum ofthese two probabilities is placed in the new node. The completed Huffmantree for the worked example is shown below:

To compress a character one starts at the correct node and follow thetree down, reading off the ‘1’s and ‘0’s as they occur. The string ofbits that this generates is the compressed character. e.g. “E” andfollow the tree down to its root; this gives 0001. Thus E is representedby a 0001.

The compressed character is sometimes backwards, so E is represented by1000. This makes it easier to decompress (because we can follow the treeup by reading the compressed character from left to right).

Similarly, to decompress a character just follow the tree up using thecompressed string of bits to decide whether to branch left or right ateach node. Eventually one of the original ten nodes is reached and thecorrect decompressed character is discovered.

As can be seen, common characters are represented by fewer bits; a“space” character is represented here by a 0.

Improved Huffman

In a well-known enhanced method of compressing a stream of charactersbased on Huffman, each character is assigned a group and it is thegroups which are treated as characters of the conventional Huffmanalgorithm. The method has significantly lower memory requirements thanordinary Huffman, allowing the size of one character to be increased andhence giving a better compression ratio. The improved Huffman methodalso uses a “character group” rather than the characters themselves tobuild a tree; the groups effectively become the characters of theordinary Huffman.

The improved Huffman tree is constructed in two stages. In the firststage the characters are divided up into groups according to a commonbehaviour pattern. A behaviour pattern may e.g. be the same probability,so characters are grouped according to their relative frequency.

The problem however is that in a compressed character, the Huffman codefor the group must be followed by a bit pattern identifying whichcharacter within the group has been compressed. If the group does notcontain a power of two characters then bit patterns are wasted, givingpoorer compression efficiency. The inventors have determined a methodwhich overcomes these problems.

It is an object of the invention to provide an improved method ofcompression and subsequent decompression of headers and characters ofbinary (or other) data units.

The inventor has determined an improved method of compression of digitaldata which makes use of detecting behaviour patterns in successive datablocks, which allows for efficient data compression. Behaviour patternsare defined as any form of non-randomness and may take any appropriateform e.g. repeats, counters where the counter is incremented by 1, orwhere data blocks alternate between a small number of values.

The inventor has developed an improved version of the Huffman methodwhich has significantly lower memory requirements than ordinary Huffman,allowing the size of one character to be increased and hence giving abetter compression ratio.

The invention comprises a method of compressing data, wherein said datais in the form of discrete units, by determining an overall codespecific to the units, comprising the steps of:

a) grouping the units in terms of a common behaviour pattern;

b) for each said group of units, assigning a group specific code, thesizes of said group code being ordered according to the probability ofthe unit occurring;

c) assigning a unit identifier code which is specific to identify anindividual character within the group,

the overall code comprising of the cocatenated group specific andidentifier codes, characterised wherein

step c), the identifier code is of the minimum size to allow for eachunit which could occur in that group to be assigned specifically.

The invention will now be described in more detail with reference toexamples.

Other objects, advantages and novel features of the present inventionwill become apparent from the following detailed description of theinvention when considered in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram that illustrates the known ordinary Huffmancompression;

FIG. 2 is a diagrammatic depiction of data compression according to theinvention; and

FIG. 3 illustrates an example of the invention and its operationaladvantages.

DETAILED DESCRIPTION OF THE DRAWINGS

The invention will now be described in detail with references to thefollowing examples.

EXAMPLE 1

A following simple basic example of the invention will now be describedwith reference to FIG. 2. In the worked example there are 4, groups orbehaviour patterns. Group A contains all the characters that turn upwith 3% probability, namely ‘B’, ‘C’, ‘1’ and ‘2’. and the other groupsare set up depending on the probability of encountering the characters:Group B contains the characters ‘A’, ‘E’ and ‘?’. Group C includes ‘D’and ‘$’, and finally Group D contains just the SPACE character.

Character Chance of occurring Group Identifier SPACE 54%  A A 8% B E 8%B ? 8% B $ 5% C D 5% C 1 3% D 00 2 3% D 01 B 3% D 10 C 3% D 11

The tree initially starts with one node for each group (4 in total). Thenodes are labelled with the number of characters in the group and theprobability that one character in the group occurs in a random stream ofcharacters.

To build the tree, there are three possible operations. Where there isan even number of characters in the group, put a node; this node isassigned double the probability, but the counter for the new node showshalf the number of characters. The node is assigned a variable “X” ofcharacters, which is filled in later depending on which character in thegroup is chosen to be compressed. Each time one moves further to theroot of the tree a new node is created; the probability is doubled andthe number of elements in the counter is halved. When the tree is usedto compress data, the “X”s are filled in depending upon which characterturns up. Rather than having multiple branching at the top of the treeone has a single track and a small array for an identifier.

E.g., in order to decompress this data the character code is 101011 1 goto left, 0 go to left, 1 go to left. One then knows its B, C, 1 or 2—thelast 2 bits tell you which character it is.

Effectively the compressed code comprises two portions; one portioncomprises the code which identifies the group, the group code. Again aswith Huffman groups which e.g. contain characters which turn up veryrarely, have longer group codes than those groups with commoncharacters. The other portion of the compressed code comprises theidentifier code which is the code which distinguishes it from othercharacters within the group. Groups with odd number of characters aresplit into two groups; one character removed to produce a new grouphaving an even number of characters and a new group containing just one,the removed character.

If there is an odd number of characters in a group, the group is splitup into two nodes. One branch represents just one of the characters; theother represents all the other characters and now represents a sethaving an even number of characters. The improved Huffman trees, atnodes where there is no branching, effectively contain an extra symbol‘X to act as an identifier. Where there is branching from a group havingan odd number of members there is an identifier “D” which is either 0 or1 to indicate which branch is which after the aforementioned splitting,i.e. if the value of “D” is 1 this may represent the branch whichrepresents the character which was removed from the group to provide aneven numbered group, and a “0” the new even-numbered group.

The ‘D’ symbol is used to split the group up into two new groups. Sinceeach new group has its own group identifier, there is no need to assign0's and 1's to the ‘D’ symbol.*

The ‘X’ identifiers in the original and new even groups identifies thecharacter within the even group.

As mentioned the inventor has determined that to optimise efficiency,one can split one node into two nodes, which is indicated using a singledigit identifier. In this specification, we refer to this as “D”.

The “X”'s and “D”'s are in effect digits of the identifying code andserve to distinguish between any two characters with the same behaviourpattern. The initial step of compression is to label every characterwith a unique identification number that distinguishes it from othercharacters with the same behaviour pattern.

The general method of creating a tree for the improved Huffman algorithmis as follows:

Search for the node with the smallest probability. Suppose that thisnode contains n characters. The next step depends on the value of n:

1) If n is even then create a new node with double the probability buthalf the number of characters n. Join this new node to the old one, andlabel the branch with an ‘X’.

2) If n is odd and n>1 then create two new nodes with the sameprobability, the one on the left containing n−1 characters and the oneon the right containing 1 character. Join these new nodes to the oldnode, labelling the branches with a ‘D’.

3) If n=1 then search for the node with the second-smallest probability.Suppose that this node contains m characters.

a) If m>1 then create two new nodes with the same probability, onecontaining m−1 characters and the other containing 1 character. Jointhese new nodes to the old node, labelling the branches with a ‘D’.

b) There is now a node with smallest probability and a node withsecond-smallest probability, both containing one character. Join thesenodes to form a new node containing one character. Label the branch withsmallest probability using a ‘1’ and the second-smallest using a ‘0’.Place the sum of the two probabilities in the new node.

For compression and decompression, each character in a group should belabelled with a unique identification number from the set {0,1,2, . . .}. This serves to distinguish between two characters in the same group.

Suppose that the character to be compressed has unique identifier i.Find the correct behaviour pattern on the tree and follow the tree down,taking these steps at each node:

1) If the node has a ‘0’ or ‘1’ branch then add this bit to the stringof compressed bits.

2) If the label is ‘X’ then add the least significant bit of i to thestring of compressed bits. Then divide i by 2 (rounded down).

3) For a label ‘D’ then if i is 0, follow the branch to the right.Otherwise decrease i by 1 and follow the branch to the left.

The resulting string of bits is the compressed character. Decompressionis simply a matter of reversing this process, using the compressed bitsto follow the tree back up to the correct behaviour pattern. The uniqueidentifier i should be initially set to 0, and is reconstructed bytaking the following steps at each node:

1) If the node branches then use the corresponding bit in the compressedstring to determine which branch to follow.

2) If an ‘X’ is reached then multiply i by 2, and then increase i by 1if the corresponding compressed bit is also ‘1’.

3) If a left-hand ‘D’ branch is reached then increase i by 1.

As can be seen, the difference between the two methods is that ordinaryHuffman encoding needs a separate node for every character.

Application to Headers

In a particular embodiment of the invention an entire header is treatedas a single character.

For 320 bit headers, this means there can be a possible 2^320 differentheaders. It would be impossible to implement the ordinary Huffman treein this case. However using the improved Huffman according to theinvention all possible headers can be divided up into a substantiallyreduced number of groups, each group containing headers which have acommon behaviour pattern. For example they may be grouped into say 1000groups, each different group having headers which turn up with the sameprobability; thus there are 1000 different probabilities that any headermight arise. There may be (especially for the less common headers) manymillion of headers within each probability group. When implementing theimproved Huffman method for headers, one groups header probability andthese groups are treated as Huffman characters.

EXAMPLE 2

FIG. 3 shows an example and illustrates the invention and how itsoperational advantages arise. A header comprises 320 bits, and can beirregular, i.e. there are 2 to the power 320 possible headers. These aregrouped into say 4 behaviour patterns: those which occur withprobability A % (say containing 4 headers), those with B % (containing 4headers) those which occur C % (say 10 headers) and those which occur D% (say this group is irregular and contains the remaining headers i.e. 2to the power 320−18). The probabilities A to D are in decreasing order.The improved Huffman tree is constructed as in Figure. The group A isrepresented by a 1, the group B by 01 the group C by 001 and group D by000. These are the first portion of the compressed headers andeffectively are the Huffman codes for the groups. The second portion isthe identifier which distinguishes the header from other headers in thesame group. In the first group A there are 4 headers so one only needs aidentifier register having 2 bits which gives 4 possibilities. Thecompressed header thus, for member of group A, comprises three bits intotal “0” and then the identifier”. For group D there are still verymany different possibilities of headers 2^ 320−18. In this case it isimpossible to have an identifier look up table for each and the headeritself becomes the identifier. Thus the complete header becomes 000tagged onto the header itself. Although the compressed header in thiscase is actually longer by 3 bits, using this system savings are madebecause most of the headers which will be encountered are not irregularand thus get compressed with shorter group codes as in Huffman butsubstantially shorter identifiers.

As a result the tree does not have too be constructed with a node todistinguish each and every possible header which would otherwise becomeunfeasibly large.

The example describes a simple example where headers are divided into 4groups. This is an extreme case to illustrate the invention and thenumber of groups would be selected in an appropriate fashion for optimalefficiency. Typically there may be 1000 groups, but this is still moremanageable than 2^320 which is impossible for a computer to deal with Ingeneral, with Huffman, the number of characters is a measure of theprocessing time.

In yet a further preferred embodiment use is made of the fact that theheader is divided up into a number of fields. These fields tend tobehave independently from each other.

EXAMPLE 3

In a simple example, of a preferred embodiment, one or more of thesub-units may be compressed according to the above described improvedHuffman method. The compressed code for the complete header wouldcomprise the cocatenated compressed codes for the sub-units and thesub-units themselves of any sub-unit which isn't compressed.

For example if a header comprises two sub-units, the first sub-unitcomprising an address and the second sub-unit a random data string, theoverall compressed header will comprise of the compressed address fieldcocatenated with the random data string.

Note that the fields in the compressed header do not have to occur inthe same order as the fields in the uncompressed header.*

The table below shows a typical header; this may contain fields such asIP header, version number, header length, reserved flag, source addressdestination address, type of service. Each field has a number ofdifferent possible behaviours: e.g. the address field is a 32 bit fieldwhich indicates the destination of the data packet; very often it is thesame as for the last header, e.g. the data packets go to the mobilephone, and only on rare occasions might have switched to a differentphone e.g. static (same as the field value in the previous header)alternating (alternate between a small number of values) irregular(completely random—can't be compressed) inferred (can work value outfrom a different field e.g. length field—how long header is can beworked out by all the other fields). This is shown in the worked examplebelow, which includes a table of a header comprising 3 fields:

Field 1 Field 2 Field 3 S/IN S/I/A S

Each field may have one or more different behaviours, In the example thefirst field can have two different types of behaviour, STATIC orINFERRED, the second field has three possible different types ofbehaviour STATIC, INFERRED or ALTERNATING and the third field only oneSTATIC. This is shown in column 2 of the table below.

EXAMPLE 4

A preferred, more complex embodiment will now be described. It isapplicable to headers which have fields, wherein one or more field canhave different types of behaviours. The first step for each field is toassign one or more different behaviour types. The next step is todetermine, for each field, the probability that it will have aparticular behaviour type, e.g. for each field determining theprobability that it will be STATIC or INFERRED etc. This is shown in thelast column of the table below.

In order to determine the number of different groups applied to theimproved Huffman method, i.e. the number of different overall headerbehaviours, one multiplies out the number of different field behavioursfor each field. In the example, a field behaviour is picked from field1, a behaviour from field 2 and a behaviour from field 3. This is thenrepeated for each combination of field behaviours. This is shown in thesecond of the tables below. Additionally the probability of eachcombination is determined as shown in the last column of the secondtable. The Huffman tree is then arranged such that groups at the top ofthe Huffman tree, which are the particular combination of header types,are arranged such that those with the smallest probability have the mostbranching and thus the longest group code and the those with the largestprobability of occurring have the shortest branching and group code.

In a further embodiment of this embodiment it is advantageous to keep aregister of, for each behaviour pattern, how many ways it can vary.There is, for example, only one way a field can be static i.e. itdoesn't change, for an 8 bit field there are 256 ways of beingirregular, and perhaps 4 ways of alternating. Knowledge of this allowsthe size of the identifier code to be determined. Where the field isirregular and there are 256 ways of being irregular the identifier coderegister need to be 8 bits in size and the identifier code wouldcomprise the field itself. Where the field is alternating between 4values the identifier register needs to be only two bits in size whichallows four different combinations.

Number of field Probability that Behaviour Behaviour values exhibitingfield will exhibit Number Type behaviour behaviour Field 1 1 STATIC 180% 2 INFERRED 1 20% Field 2 1 STATIC 1 80% 2 ALTERNATING 4 15% 3IRREGULAR 256  5% Field 3 1 STATIC 1 100% 

Probability of Possibility 1^(st) field 2^(nd) field 3^(rd) field comb g1 S S S 0.8 × 0.8 × 1 2 S I S 0.8 × 0.05 × 1 3 S A S 0.8 × 0.15 × 1 4 NS S 0.2 × 0.8 × 1 5 N I S 0.2 × 0.05 × 1 6 N A S 0.2 × 0.15 × 1

Various methods may be employed in order to arrange for the compressionto take place. In a static mode, the scheme is manually programmed tocompress one protocol stack with optimal efficiency. The input is a listof the fields within the protocol stack, and for each field a set ofpossible ways in which the field can behave. This input list isconverted into a form suitable for the improved Huffman method, asexplained above. The method calculates the best format for thecompressed headers and stores the results in the form of a Huffman tree.Alternatively the compressor and decompressor are programmed by sendinga special profiling message containing the list of field behaviourpatterns. This message is usually sent by the network administratorwhenever a new protocol stack needs to be added to the system. A furtheralternative is a “learning mode” which scans an arbitrary packet streamand dynamically learns how best to compress the packet headers. Theefficiency of this mode depends on the number of behaviour types thatcan be detected. A further preferred embodiment is a “Hybrid Mode” wherethe system is pre-programmed to compress a protocol stack just as inStatic Mode, but Learning Mode is also activated in case the protocolstack can be even more efficiently compressed. This mode is especiallyuseful for coping with unexpected changes in the way the protocol stackbehaves.

Generating Byte-Aligned Headers

Certain link layer protocols require all data packets to be a wholenumber of bytes long. Since the payload is already in this form, thismeans that the compressed headers must also be byte-aligned.Byte-aligned headers can be generated in two ways. The simplest is tostuff each header with enough bits to round the header up to a wholenumber of bytes. These extra bits can then be used for CRC checksum orsequence number to protect against lost packets. However, this method isinefficient and tends to randomise the amount of error checking from oneheader to the next. A better alternative is to always generatebyte-aligned headers in the first place. In ordinary Huffman this can beachieved by recursively joining the 256 nodes with smallestprobabilities (instead of the 2 nodes with smallest probabilities). Eachof the 256 branches is labelled with a different 8-bit pattern. Theimproved Huffman algorithm can also be modified to generate byte-alignedheaders in a similar manner.

The system can also easily handle variable-length fields. In fact, it issimply a matter of adding one behaviour pattern for each possible lengthof the variable-length field. Note that this encoding implicitlyincludes the length of the field, so if there is a separate fieldcontaining the length value then it should be classed as INFERRED toavoid transmitting the information twice.

The foregoing disclosure has been set forth merely to illustrate theinvention and is not intended to be limiting. Since modifications of thedisclosed embodiments incorporating the spirit and substance of theinvention may occur to persons skilled in the art, the invention shouldbe construed to include everything within the scope of the appendedclaims and equivalents thereof.

1. A method of compressing data, wherein said data is in the form ofdiscrete units, by determining an overall code specific to the units,comprising the steps of: a) grouping the units in terms of a commonbehavior pattern; b) for each said group of units, assigning a groupspecific code, the sizes of said group code being ordered according tothe probability of the unit occurring; c) assigning a unit identifiercode which is specific to identify an individual character within thegroup, the overall code comprising the concatenated group specific andidentifier codes; characterized wherein, step c), the identifier code isof the minimum size to allow for each unit which could occur in thatgroup to be assigned specifically; said discrete units are is dividedinto a number of sub-units; at least one sub-unit is treated andcompressed; and the overall compressed code comprises a concatenation ofany compressed sub-unit codes and any uncompressed sub-units themselves.2. A method as claimed in claim 1, wherein the identifier code comprisesthe data unit itself.
 3. A method of compressing data, wherein said datais in the form of discrete units, by determining an overall codespecific to the units, comprising the steps of: a) grouping the units interms of a common behavior pattern; b) for each said group of units,assigning a group specific code, the sizes of said group code beingordered according to the probability of the unit occurring; c) assigninga unit identifier code which is specific to identify an individualcharacter within the group, the overall code comprising the concatenatedgroup specific and identifier codes; characterized wherein, step c), theidentifier code is of the minimum size to allow for each unit whichcould occur in that group to be assigned specifically; at least onesub-unit is assigned a plurality of behavior types; and said groups ofunits are grouped in terms of having the combination of sub-unitbehavior types.
 4. A method as claimed in claim 3, wherein the sub-unitsare grouped according to particular probability ranges that theparticular sub-unit will occur.
 5. A method according to claim 3,wherein the sub-units are grouped according to at least one of thefollowing types of behavior: static, alternating, inferred or irregular.6. A method according to claim 3, wherein the number of possiblesub-units which could occur in the sub-unit group is determined in orderto determine the size of a register for the sub-unit code.
 7. A methodas claimed in claim 3, wherein the probability of each combination ofsub-unit behavior types is determined, each combination forming aseparate group.
 8. A method as claimed in claim 3, wherein said dataunit is a header.